Tidyverse: Data Wrangling 101
0. Prerequisites
0.1 Target audience
This course is targeted at Master’s and Ph.D. students with a basic understanding of the R programming language, but want to manipulate (large) datasets with more ease than using spreadsheet software. In other words, being comfortable with basic R operations is a required.
If you can answer the following questions without much of a hassle, please move on to 0.2 Preparations:
What does the
<-operator do?How to create a vector with the values
"this","is","a","vector"?How to access R’s built-in help files (also referred to as “R documentation”) for the function
shapiro.test?How to install and load a new package into your R session?
Why doesn’t the following piece of code work (run code in an R session to check):
foobar <- data.frame(x = 1:10, y = rnorm(10, 5, 1)) plot(FOOBAR)
In case you had trouble with any of these questions, please take some time to get comfortable with some R basics. As we have a lot of ground to cover, it would be unwise to jump in unprepared! I strongly recommend the swirl package, which interactively introduces you to R (see swirl’s website for more information). To get started with swirl right away, install the package using install.packages("swirl"), load it into your session with library(swirl) and jump-start your journey with swirl(). Going through the first chapter (1: Basic Building Blocks) should suffice, but don’t let that stop you from learning moRe!
0.2 Preparations
Please go through all of the steps below before attending the workshop.
Make sure to have at least R version 4.1.0 installed on your computer. Additionally, I strongly recommend installing RStudio, as I will be using this as my Integrated Development Environment (IDE) throughout this course.
- Installing R: https://www.r-project.org/
- Installing RStudio: https://www.rstudio.com/products/rstudio/download/#download
Finally, install tidyverse using install.packages("tidyverse").
0.3 Overview
For an overview of all sections covered within this material, please refer to the sidebar of this document or use the hyperlinks shown below. Sections 1 to 4 are not required for working with tidyverse, but are recommended to expand your understanding of how these packages work the way they do. Additionally, you’ll learn how to deal with importing data, including a section on larger datasets. Starting from section 5, we will get our hands dirty with actual tidyverse data manipulation.
1. Introduction
2. The whole game
3. Tibbles and pipes
4. Importing data
[5. Data manipulation with dplyr]
[6. Tidy data with tidyr]
[7. DataViz: ggplot2]
[8. Common errors]
[9. Life cycle]
[10. What’s next?]
[11. Cheat sheets]
1. Introduction
1.1 Context
As a biology student, I was introduced to R in the very first year of the programme. With R being my first scripting language, it was as much an uphill struggle as any other new language. In the second year, R was thrown on the table again in the context of statistics, with another round of RStats in the Master’s programme, 2 years later. In this time, I used R only as a means to perform statistical tests. As real, raw data was rarely in the format that was presented during any of the statistical courses, I cleaned, filtered, pivoted, … all of it using MS Excel. If you haven’t already, this can be a very time- and energy-consuming endeavour! Indeed, we never really learned how to clean and wrangle our datasets, leading to a lot of trouble and frustration toward data analysis.
During my thesis, however, I found out about ‘Tidyverse’, but never truly submerged myself. At the start of my Ph.D. in October 2020, I seized the moment to learn the ropes of this set of packages, and I haven’t looked back since (and learned more about R along the way, as well). To potentially save you a lot of time and trouble - whether you are a Master’s or Ph.D. student, or even something beyond that -, I want to share with you some of the things I have learned along the way. For the record, I’m far from an expert on the matter, and there is still a lot left to explore!
This origin story aside, hopefully this material will prove to be helpful somewhere along your data journey. There are many ways to deal with data tidying and wrangling, and the tidyverse just happens to be the one I prefer at the moment. Feel free to send me any and all feedback you may have to Stijn.VandeVondel@uantwerpen.be.
1.2 Tidyverse
The tidyverse is “an opinionated collection of R packages designed for data science”. In that sense, tidyverse can be represented as a virtual basket containing different packages, which “all share an underlying design philosophy, grammar, and data structures”. In other words, these packages and their corresponding functions easily interact with each other, allowing for a wide range of tools to tinker with data.
If you haven’t already, install the tidyverse (install.packages("tidyverse")) and load it into your R environment.
# load tidyverse
#install.packages("tidyverse")
library(tidyverse)
-- Attaching packages --------------------------------------- tidyverse 1.3.1 --
v ggplot2 3.3.5 v purrr 0.3.4
v tibble 3.1.4 v dplyr 1.0.7
v tidyr 1.1.3 v stringr 1.4.0
v readr 2.0.1 v forcats 0.5.1
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()As shown above, a series of packages will be installed. Each of these packages is listed below, along with a brief description borrowed from the packages’ documentation. In addition to these ‘core’ packages, other R libraries are also installed along with tidyverse, but are mostly beyond the scope of this workshop.
- ggplot2: A system for ‘declaratively’ creating graphics, based on “The Grammar of Graphics”.
- dplyr:
dplyrprovides a grammar of data manipulation, yielding a consistent set of verbs that solve the most common data manipulation challenges. - tidyr:
tidyrprovides a set of functions that help you acquire tidy data. Tidy data is data with a consistent form: in brief, every variable goes in a column, and every column is a variable. This is part of the core philosophy of tidy data. - readr:
readrprovides a fast and friendly way to read rectangular data (like .csv, .tsv, and .fwf). It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes. - purrr:
purrrenhances R’s functional programming toolkit by providing a complete and consistent set of tools for working with functions and vectors. Once you master the basic concepts, purrr allows you to replace many for loops with code that is easier to write and more expressive. - tibble:
tibbleis a modern re-imagining of the data frame, keeping what time has proven to be effective, and throwing out what it has not. Tibbles are data.frames that are lazy and surly: they do less and complain more forcing you to confront problems earlier, typically leading to cleaner, more expressive code. - stringr:
stringrprovides a cohesive set of functions designed to make working with strings as easy as possible. It is built on top ofstringi, which uses the ICU C library to provide fast, correct implementations of common string manipulations. - forcats:
forcatsprovides a suite of useful tools that solve common problems with factors. R uses factors to handle categorical variables; variables that have a fixed and known set of possible values. - Others:
broom,cli,crayon,dbplyr,dtplyr,googledrive,googlesheets4,haven,hms,httr,jsonlite,lubridate,magrittr,modelr,pillar,readxl,reprex,rlang,rstudioapi,rvest,xml2
If you have already installed tidyverse earlier, you may want to check whether all packages contained within are up-to-date.
# check for updates
tidyverse::tidyverse_update()If a package is out of date, you will receive a notification and instructions to update outdated packages.
Try tidyverse_packages(include_self = TRUE) and see for yourself!
1.3 Package conflicts: masking
As shown in 1.2 Tidyverse, library(tidyverse) attaches multiple packages to your R session. Additionally, a couple of so-called Conflicts will be shown. As these conflicts are not exclusive to tidyverse, but become apparent when you start loading packages into R, it is important to know what exactly these conflicts entail.
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
In English, filter() and lag() from the dplyr package share their names with filter() and lag() from the stats package (included with any R installation). In other words, once dplyr has been attached to your R session, filter() and lag() from the stats package will no longer be accessible (unless called explicitly using e.g. stats::filter() ). The :: in dplyr::filter() indicates that filter() originates from the namespace of dplyr. More specifically, :: allows accessing a specific package’s functions without loading the entire package into R (see ?'::' for more details).
2. The whole game
2.1 In a nutshell
“Tidy datasets are all alike, but every messy dataset is messy in its own way.”
More often than not, raw data will not be formatted in a very accessible, analysis-friendly way. Experiments generally do not produce clean trees.csv or covid_cases.csv files, but datasets that are wild, exotic or downright savage. This is especially likely if someone else collected data for you, but did not have any prior knowledge about your general set-up. If you cannot recall hours of tedious data tinkering in Excel following a group lab practical, then have you really lived your student life to the fullest? ;-)
If only Robbie would stop bothering Alexa… Source: Jon Carter
To deal with such datasets, rigorous cleaning and wrangling is required before even thinking about modelling or visualizing the story residing within your data. The schematic below (as shown in R for Data Science; great reference material!) encompasses the entire process of data science with tidyverse; this course is (mostly) limited to the parts highlighted in blue. Most importantly, data tidying and transformation will be taking center stage, with some notes on importing data and dipping our toes in visualization.
Data wrangling. Source: R for Data Science
2.2 Data cleaning vs data wrangling
The above schematic in words: Raw data first needs to be imported by loading it into the R environment (which usually means effectively loading data into memory (RAM)). Once loaded, data often requires tidying (or data cleaning) and transformation (or data wrangling) prior to any analyses down the line. In more exact terms, data cleaning is the process through which errors are fixed and data quality is ensured, while data wrangling would be defined as the process through which raw data is manipulated and transformed. As tools for both processes can often be used interchangeably, I will not distinguish between these definitions. For the sake of everyone’s convenience, I will simply talk about data wrangling as a whole.
3. Tibbles and pipes
Before diving head first into the tidyverse, we will need to talk about tibbles and pipes. Very many functions within the tidyverse produce tibbles, making it at least worth mentioning. Pipes, on the other hand, are a powerful tool for clearly expressing a sequence of multiple operations.
3.1 Tibbles
Tibbles are dataframes, but with a twist. For the sake of comparison and clarification, we will create a classical R data frame and a tibble with the same content.
# set seed for reproducibility
set.seed(1)
# create a dataframe
a_dataframe <- data.frame(x = 1:25,
y = rnorm(25, 1, 2))
# create a tibble from the dataframe
a_tibble <- as_tibble(a_dataframe)Here, we have created two datasets, each containing the same information: column x with numbers from 1 to 25, and column y with 25 random observations drawn from a normal distribution. Both have the same number of variables (2) and observations (25), and will produce the same results (try running e.g. all.equal(mean(a_dataframe$y), mean(a_tibble$y))). This begs the question of what, exactly, is different?
One hint toward the answer can be obtained by running both objects (simply a_dataframe and a_tibble in the R console) and reviewing the output.
A regular dataframe (left) and a tibble (right). The tibble shows a couple of distinct features to improve printing and inspection of your data.
As shown in the figure above, a couple of features are shown in a tibble that are non-existent for ‘regular’ dataframes. In a way, a tibble is nothing more than a data frame with some extra ‘quality of life’ features. However, there are other important differences going on under the hood, encapsulating best practices for data frames. Read more on tibbles here.
3.2a Pipes: short version
Pipes are a special operator aimed at making code more intuitive to read and write (though opinions may differ). The tidyverse pipe is written as %>% (CTRL+SHIFT+M in RStudio) and originates from the magrittr package (to be pronounced with a sophisticated french accent). They are automatically loaded in with library(tidyverse), but can also be loaded separately using library(magrittr) or library(magrittr, include.only = "%>%").
In brief, the magrittr pipe passes the output of what comes before the pipe (left-hand side) as input to the function after the pipe (right-hand side). The pseudo-code below shows what this looks like in the context of baking cookies in a factory, going through the functions and pipes as if it were a conveyer belt or pipeline.
raw_ingredients <- ("butter", "sugar", "eggs", "chocolate chips", "...")
choc_chip_cookies <- raw_ingredients %>%
make_dough() %>%
shape_cookie() %>%
transport_to_oven() %>%
bake_yummie_cookies() %>%
cool_cookies() %>%
pack() %>%
send_away()In case you don’t completely understand the %>% yet, I’ve written a more lengthy section below (3.2b Pipes: long version). It will help you to better grasp the benefits of the operator, but is not required to get you going with tidyverse. The bottom line remains: %>% passes what comes before to what comes after the pipe, effectively creating a virtual pipeline of consecutive operations.
3.2b Pipes: long version
To give you a working example, I will provide some code (see below) with one of the most used built-in R datasets: mtcars. Don’t worry too much about the different functions used (we will get back to most of those later!), but pay attention to the use of the %>% operator in example 1.
# example 1: piping
df_cars <- mtcars %>%
rownames_to_column("car") %>%
filter(str_detect(car, "Merc")) %>%
# convert miles per gallon -> km per liter
mutate(kml = mpg*(1.60934/3.78541)) %>%
select(car, kml, cyl, hp)As the benefit of the %>% will not immediately become clear following this example, consider the following code blocks (examples 2 to 4):
# example 2: nesting
df_cars <- select(
mutate(
filter(rownames_to_column(mtcars, "car"), str_detect(car, "Merc")),
kml = mpg*(1.60934/3.78541)),
car, kml, cyl, hp) # example 3: overwriting
df_cars <- rownames_to_column(mtcars, "car")
df_cars <- filter(df_cars, str_detect(car, "Merc"))
df_cars <- mutate(df_cars, kml = mpg*(1.60934/3.78541))
df_cars <- select(df_cars, car, kml, cyl, hp) # example 4: base R
df_cars <- mtcars
df_cars$car <- row.names(df_cars)
df_cars <- df_cars[, c(ncol(df_cars), 1:ncol(df_cars)-1)]
df_cars <- df_cars[grep("Merc", df_cars$car), ]
row.names(df_cars) <- NULL
df_cars$kml <- df_cars$mpg*(1.60934/3.78541)
df_cars <- df_cars[c("car", "kml", "cyl", "hp")]All of the above examples will yield the same df_cars object at the very end. In my humble opinion, example 1 is the most readable and maintenance-friendly code by far (but, again, mileage and opinions may vary). Once you become used to piping multiple operations together into one chain, you no longer need to intermediately save or overwrite old data (with some exceptions that are bound to cross your path). Additionally, code becomes more intuitive and readable if used correctly.
Another thing you may (or may not) have noticed, is how in example 1 (as well as in example 2, but for different reasons) df_cars is only mentioned once, while example 3 and 4 mention the object df_cars 7 and 15(!!) times, respectively. As for example 2, nesting all of the functions limited the number of mentions of df_cars, at the cost of readability. Naturally, one could also nest some of the operations in example 4, but I think we all have better things to do!
In any case, where is each function in example 1 getting their data from, or how does it know which one to use? And what is the order of execution of each of these function calls? To explain this, imagine a factory that produces your favourite type of cookie - I will go with the classic ol’ chocolate chip. At one end, raw ingredients (butter, sugar, eggs, chocolate chips, …) are delivered to the factory’s doorstep. At the other end, the factory pumps out boxes chock-full of delicious cookies.
Of course, we all know the factory isn’t a black box, but rather an intricate system of many different steps. First, raw ingredients need to be mixed into a batter and thickened into a dough. Next, this dough is poured into moulding machines where the cookies are given their iconic shape. Then, the cookie-shaped dough moves down a conveyor belt to an industrial oven for baking. Finally, after cooling these heavenly cookies, they are packed and sent away. Let’s write this into some pseudo-code using the magrittr pipe:
raw_ingredients <- ("butter", "sugar", "eggs", "chocolate chips", "...")
choc_chip_cookies <- raw_ingredients %>%
make_dough() %>%
shape_cookies() %>%
transport_to_oven() %>%
bake_yummie_cookies() %>%
cool_cookies() %>%
pack() %>%
send_away()In case you hadn’t noticed yet, the pipe passes the output from the function before it, to the function after the pipe (in the example above, it passes the output from the first line to the next line as the input). As such, raw_ingredients is passed on to make_dough(), and the result of make_dough() is passed on to shape_cookies(). Once the data has gone through send_away(), it is stored in the object called choc_chip_cookies - the finished box of cookies, if you will. In terms of the code above, you could also write each of the cookie-making steps on one line of code (like a virtual conveyor belt), but this would make the code much less readable (head on over to this style guide for more info on code styling within tidyverse).
Without going into too much detail, this behaviour is also engrained into most tidyverse functions. Most of these functions use the output of whatever comes before the pipe as the input for the operation after the pipe. If you want to explicitly refer to this input-output within these function calls, the dot (.) placeholder can be used:
raw_ingredients <- ("butter", "sugar", "eggs", "chocolate chips", "...")
choc_chip_cookies <- raw_ingredients %>%
make_dough(.) %>%
shape_cookie(.) %>%
transport_to_oven(.) %>%
bake_yummie_cookies(.) %>%
cool_cookies(.) %>%
pack(.) %>%
send_away(.)For now, this is all you need to know about pipes (and far more than I knew when I started out). In brief, the magrittr pipe passes the output of what comes before the pipe (left-hand side) as input to the function after the pipe (right-hand side). For more technical information, see here.
NOTE: Ever since R version 4.1.0, R also sports a native pipe operator |>. Its behaviour is highly similar to %>%, but is now part of the R language itself, while %>% needs to be imported from a package. For further reading, head on over to following comparison. The |> operator will not be covered in this course.
‘Ceci n’est pas une pipe’, by Belgian artist René Magritte, which served to be the etymological inspiration for the magrittr package.
3.3 Tidy data
As already touched upon in the description of the dplyr package, tidy data is data with a consistent form and follows three rules:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
The three rules of tidy data. Source: ’R for Data Science’
If data is tidy, then every variable goes in a column, and every column is a variable. Tidy data is not desirable in all cases, but can prove to be a very robust way of structuring data when using tidyverse. For those who have already worked with ggplot2 may know what I am talking about! For more more information and examples, check 12.2 Tidy data in R4DS.
4. Importing data
4.1 readr and base R
As this workshop is more about the actual data cleaning and wrangling, I will only go over importing data very briefly.
Within tidyverse, the readr package was developed to provide a fast and friendly way to read rectangular data (e.g. csv). The readr functions you’re likely to use most often are:
read_csv()to read comma (,) delimited files;read_csv2()to read semicolon (;) separated files (a common file type in Belgium, where,is used as the decimal point);read_tsv()to read tab delimited files;read_delim()to read files with any delimiter.
As you may already know, base R also has tools to import data (e.g. read.csv(), read.csv2, read.table(), …), but these are generally slower than those provided by readr. Regardless, this is unlikely to be an issue unless you are working with larger datasets (>1 million observations).
4.2 Big data
In case your workflow is suffering from large data files, fear not - there are many powerful tools at your disposal! Introducing vroom and data.table. Both packages use multiple threading, which is very beneficial if your computer possesses multiple CPU cores (which is often the case, nowadays). This, along with some other nifty features, allows reading and writing data very fast (one could say vrrroooom). As opposed to data.table, vroom does not fully read data into memory, but only indexes it. This means that only columns and rows you would actually put to use would be read.
That said, data.table still ought to be faster in case of numeric data than vroom. On top of that, data.table provides tools and syntax to wrangle data much more efficiently than e.g. tidyverse functions. In addition, data.table is also faster and more memory-efficient in doing so, but its syntax is more difficult to read and write. As a compromise, dtplyr has been called into existence within the tidyverse, which uses the same ‘tidy verbs’ you’ll become familiar with, but translates this into data.table syntax to benefit of its sheer speed (with some minor loss of speed due to overhead, and loss of memory-efficiency). One additional thing I want to mention about data.table is its very convenient and fast way of reading (fread) and writing (fwrite) data with. It is highly similar to base R’s read.table(), but automatically detects column separators, data types, and has many arguments to customize the function call.
Nevertheless, as of readr 2.0.0, the package uses vroom as a backend, granting an impressive speed boost. Below, I provide a benchmark comparing base R’s read.csv(), the previous (*.old) and current version of read_csv() (*.new), as well as data.table::fread() to import a .csv file containing 16 columns of > 34 million rows, mostly numerical data (total size: 3.56 GB, uncompressed).
Unit: seconds
expr min lq mean median uq max neval
base.read.csv() 111.372018 113.808849 114.516900 113.859472 116.155927 117.388234 5
readr.read_csv.old() 37.959150 38.016099 38.741874 38.960399 39.156582 39.617139 5
readr.read_csv.new() 4.326258 4.599566 4.941215 4.649826 4.681539 6.448888 5
dt.fread() 3.465170 3.667079 3.716254 3.742631 3.815739 3.890653 5There are ways to optimize functions (such as read.table()) to handle data more efficiently, or by parallelizing operations, but these are very situational and far beyond the scope of this workshop.
To find out more about vroom and data.table, click on the embedded links.
Session Info
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)
Matrix products: default
locale:
[1] LC_COLLATE=Dutch_Belgium.1252 LC_CTYPE=Dutch_Belgium.1252
[3] LC_MONETARY=Dutch_Belgium.1252 LC_NUMERIC=C
[5] LC_TIME=Dutch_Belgium.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4
[5] readr_2.0.1 tidyr_1.1.3 tibble_3.1.4 ggplot2_3.3.5
[9] tidyverse_1.3.1 rmdformats_1.0.2 knitr_1.33
loaded via a namespace (and not attached):
[1] tidyselect_1.1.1 xfun_0.25 bslib_0.2.5.1 haven_2.4.3
[5] colorspace_2.0-2 vctrs_0.3.8 generics_0.1.0 htmltools_0.5.2
[9] yaml_2.2.1 utf8_1.2.2 rlang_0.4.11 jquerylib_0.1.4
[13] pillar_1.6.2 withr_2.4.2 glue_1.4.2 DBI_1.1.1
[17] dbplyr_2.1.1 modelr_0.1.8 readxl_1.3.1 lifecycle_1.0.0
[21] munsell_0.5.0 gtable_0.3.0 cellranger_1.1.0 rvest_1.0.1
[25] evaluate_0.14 tzdb_0.1.2 fastmap_1.1.0 fansi_0.5.0
[29] highr_0.9 tufte_0.10 broom_0.7.9 Rcpp_1.0.7
[33] backports_1.2.1 scales_1.1.1 jsonlite_1.7.2 fs_1.5.0
[37] hms_1.1.0 digest_0.6.27 stringi_1.7.4 bookdown_0.23
[41] grid_4.1.1 cli_3.0.1 tools_4.1.1 magrittr_2.0.1
[45] sass_0.4.0 crayon_1.4.1 pkgconfig_2.0.3 ellipsis_0.3.2
[49] xml2_1.3.2 reprex_2.0.1 lubridate_1.7.10 rstudioapi_0.13
[53] assertthat_0.2.1 rmarkdown_2.10 httr_1.4.2 R6_2.5.1
[57] compiler_4.1.1